CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites

Background It is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods. Methods and findings The original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS. Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ∼91%, and CI-SpliceAI showing an improvement at ∼92% overall. Predictive accuracy increased in the majority of curated variants. Conclusions We show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements.


S2 Appendix. Scraping and Quality Control of Variant Data
Incorporating Wai et al. Table S1 of [1] consists of 258 (actually 259) variants across 65 genes. There is a duplicate HGVS ID (variants 32/33), where apparently there was a copy-and-paste error. In the original publication, variant 32 was incorrectly called NM 007294.3:c.5024C>T (duplicating the entry below) and with the authors help the ID was corrected to NM 007294.3:c.5074+7C>T. Variant 220 is really two; the authors could not determine which variant was causing the effect, so both variants were removed.
The splicing annotation from the source was changed to a binary form ("Normal"/everything else). After parsing the RefSeq ID to genomic coordinates, 12 variant locations were found to be offset by 1bp. This was rectified by fetching all genomic coordinates (see S2 Appendix).
The publication [2] contains table S1 with an aggregation of 272 (269 really since 3 were not disclosed) variants, 124 new ones, 50 previously published variants across 45 publications, and 98 without attribution.
A number of HGVS IDs were not recognised by Ensembl VEP, so were manually amended. Furthermore, only data points where the RT-PCR outcome indicated a conclusive splicing disruption were included.
NM 001040656.1 was deprecated by NCBI, NM 001077416 is not supported by ensembl VEP, both variants were removed.
Incorporating Leman et al.
Tables S1-S3 from [3] were used, containing a total of 254 variants (141 breast cancer variants of their own, the rest compiled from 66 publications) across 11 genes. NM 007294.3:c.133 136del is an invalid ID that could not be manually resolved as it's unclear if this is a single nucleotide deletion or if it's removing a range of nucleotides. Transcript/Variant annotations were used to generate HGVS IDs, and the Splicing Effect field was used as ground truth.
65 of the variants are published as a HTML table and 207 variants on a PDF table across 17 pages. Annotations from HTML were extracted through copy-and-paste into Microsoft Excel, the PDF table was parsed using Tabula [5], followed by manual correction of OCR issues. 12 annotations where the outcome was not obvious were removed, only retaining entries tagged as acceptor/donor loss/gain / skipping / retention. One variant had no annotated observation (NM 000059.3:c.7056T>A), which was also removed. Some IDs contained recurrence annotations in their ID, which were removed as that was syntactically invalid. The remaining 267 variants were extracted. Following their paper, variants with an annotated P-Value < 0.01 were annotated as splice affecting; the remainder as non-affecting.
Incorporating Ellingford et al. Table 1 of [7] published 21 variants and their functional assessment of splice disruption which have been extracted manually.
Incorporating MutSpliceDB MutSpliceDB [8] is a freely accessible genome database consisting of, at the time of writing, 86 variants and their effects on splicing. All variants are disruptive. Variants were exported using their web interface.
Extraction of Genomic Coordinates GRCh38 coordinates for the HGVS IDs were fetched automatically using ensembl VEP [9].
Errors returned by this service were resolved manually by querying the RefSeq ID in NCBI Nucleotide [10] that will link to the latest RefSeq transcript version. For some IDs this still failed, in which the version was removed completely. If both strategies failed, manual investigation revealed some faults that could be rectified (missing colons, mangled protein annotations, missing right-bounds); the remainder of incorrect HGVS IDs (mismatching reference annotations, deprecated transcripts, missing variant annotations, or ambivalent range annotations) were dropped.
S3 Appendix. Calculation of the Delta Score and indel compensation.
The difference between predictions for canonical and alternative sequences build the δ-score, which in most cases translates into subtracting the two predictive matrices directly, i.e. δ = P v − P r . This however does not work for indels, where the shape of the predictive matrices differ and therefore cannot be subtracted directly. Following [11], the variant annotations are changed by either padding deletions or truncating insertions using a max function (P v , Fig 8). This method prevents offsets by aligning indels back to the reference genome.
Given the delta matrix, the delta position (DP) and delta score (DS) for the most significant events for the events acceptor gain (AG), acceptor loss (AL), donor gain (DG) and donor loss (DL) are commonly extracted (see last step in Fig 8). For those data points, where the exact effect and position of the variant is known, the significance and position is compared to the annotations in the exact classification task How the delta score is calculated. The blue rectangle indicates an indel event where the output matrices P r and P v for reference and variant predictions do not align and need to be compensated. P v is the re-aligned variant matrix, which either pads deletion predictions with zeros or truncates insertion predictions with a max function. The delta score can then be calculated by subtraction. SpliceAI annotations return the delta score (DS) and delta position (DP) for the maximum and minimum values on the acceptor and donor row, quantifying and locating the events acceptor gain/loss (AG/AL) and donor gain/loss (DG/DL).